Picture for Sicheng Xu

Sicheng Xu

Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs

Add code
Jun 01, 2026
Viaarxiv icon

Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data

Add code
Apr 15, 2026
Viaarxiv icon

HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

Add code
Mar 26, 2026
Viaarxiv icon

VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image

Add code
Dec 16, 2025
Viaarxiv icon

Native and Compact Structured Latents for 3D Generation

Add code
Dec 16, 2025
Viaarxiv icon

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

Add code
Oct 24, 2025
Figure 1 for Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
Figure 2 for Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
Figure 3 for Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
Figure 4 for Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
Viaarxiv icon

Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

Add code
Jul 31, 2025
Viaarxiv icon

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Add code
Jul 03, 2025
Figure 1 for MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Figure 2 for MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Figure 3 for MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Figure 4 for MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Viaarxiv icon

Structured 3D Latents for Scalable and Versatile 3D Generation

Add code
Dec 02, 2024
Figure 1 for Structured 3D Latents for Scalable and Versatile 3D Generation
Figure 2 for Structured 3D Latents for Scalable and Versatile 3D Generation
Figure 3 for Structured 3D Latents for Scalable and Versatile 3D Generation
Figure 4 for Structured 3D Latents for Scalable and Versatile 3D Generation
Viaarxiv icon

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Add code
Nov 29, 2024
Figure 1 for CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
Figure 2 for CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
Figure 3 for CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
Figure 4 for CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
Viaarxiv icon